perm filename VIS[0,BGB]7 blob
sn#085217 filedate 1974-02-01 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00020 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00003 00002 2.0 Computer Vision Theory.
C00004 00003 2.0 Introduction.
C00006 00004 2.1 Vision Systems.
C00010 00005
C00014 00006
C00019 00007 2.2 Vision Tasks.
C00025 00008
C00031 00009
C00037 00010 2.3 The Nature of Images.
C00050 00011 2.4 The Nature of Worlds.
C00058 00012
C00062 00013 2.5 Locus Solving.
C00065 00014 2.6 Comparing.
C00068 00015 2.7 Mobile Robot Vision.
C00073 00016
C00078 00017
C00083 00018 2.8 Related Work.
C00089 00019 2.9 Visual Consciousness.
C00103 00020 2.10 Summary.
C00106 ENDMK
C⊗;
2.0 Computer Vision Theory.
CONTENTS:
2.0 Introduction.
2.1 Vision Systems.
2.2 Vision Tasks.
2.3 The Nature of Images.
2.4 The Nature of Worlds.
2.5 Locus Solving.
2.6 Comparing.
2.7 Mobile Robot Vision.
2.8 Related Work.
2.9 Visual Consciousness.
2.10 Summary.
BOXES:
2.1 Three basic modes of vision.
2.2 Vision Mandala.
2.3 Some basic kinds of vision processors.
2.4 Table of 3-D computer vision tasks.
2.5 Alternatives to 3-D geometric modeling.
2.6 Three kinds of Locus Solving.
2.7 Chauffer Cart Task Solution.
FIGURES:
2.1 Vision System Hierarcy
2.2 Cart Vision Mandala
2.0 Introduction.
Computer vision concerns programming a computer to do a task
that demands the use of an image forming light sensor such as a
television camera. The theory I intend to elaborate is that general
3-D vision is a continuous process of keeping an internal visual
simulator in sync with perceived images of the external reality so
that vision tasks can be done by reference to the simulator's model
rather than by reference to the original images. The word "theory",
as used here, means simply a set of statements presenting a
systematic view of a subject. Specifically, I wish to exclude the
connotations that the theory is a mathematical theory or a natural
theory. Perhaps there can be such a thing as an "artificial theory"
which extends from the philosophy thru the design of an artifact.
2.1 Vision Systems.
Vision systems mediate between images and world models, these
two extremes of a vision system are called, in the jargon, the
"bottom" and the "top" respectively. In what follows, the word
"image" will always be used only to refer to the notion of a 2-D data
structure representing a picture. (A picture is a scan rectangle
taken from the pattern of light formed by a thin lens on the nearly
flat photoelectric surface of a television camera's vidicon). A
sequence of images in time will be called a film. On the other hand,
a "world model" is a data structure which is supposed to represent
the physical world for the purposes of a task processor. The careful
definition of the phrase "world model" is one of the questions to be
pursued; in particular, my main point concerns isolating a portion
of the world model (to be called the 3-D geometric world model) and
placing it below most of the other entities that a task processor has
to deal with. The vision hierarcy, so formed, is illustrated in the
figure.
-------------------------------------------------------------
| Figure: Vision System Hierarcy. |
| |
| |
| Task Processor |
| | |
| Task world model |
| The Top →→ World Model | |
| 3D geometric model |
| | |
| The Bottom →→ 2D images |
| |
-------------------------------------------------------------
By considering ways to cross the gap between images and world
models, it can be found that a general vision system seems to have
three distinguishable basic modes of operation: recognition,
verification and description.
Recognition vision, also called pattern recognition, can be
characterized as bottom up random access. What is in the picture is
determined by extracting a set of features from the image and by
classifing them according to a statistical key.
Verification vision, also called top down or model driven
vision, involves predicting an image, followed by comparing the
predicted image and a perceived image for differences which are
expected but not yet measure.
Descriptive vision, alias bottom up or data driven vision
involves converting the image into a different representation which
makes it possible to do the vision task. I would like to call this
kind of vision "revelation vision" at times; although I acknowledge
that the phrase "descriptive vision" is the term used by most members
of the computer vision community.
Box 2.1
---------------------------------------------------------------------
| Three Basic Modes of Vision. |
| |
| 1. Recognition Vision - feature classification. |
| (bottom up random access into existing top). |
| |
| 2. Verification Vision - Model Driven Vision. |
| (nearly pure top down vision). |
| |
| 3. Revelation (Descriptive) Vision - Data Driven Vision. |
| (nearly pure bottom up vision). |
---------------------------------------------------------------------
Now we have enough pieces to outline the bare system design.
By placing a 3-D geometric model in the gap; recognition vision can
be done on 3-D (rather than 2-D) features into the task world model;
and description vision and verification vision can be used to link
the 2-D and 3-D models in a relatively dumb, mechanical fashion.
Previous attempts to use recognition vision, to bridge directly the
large gap between 2-D images (of 3-D objects) and the task world
model, have been frustrated because the characteristic 2-D image
features of a 3-D object are very dependent on the 3-D physical
processes of occultation, rotation and illumination. It is these
processes that will have to be modeled and understood before the
features relevant to the task processor can be deduced from the
perceived images. The arrangement of these elements is diagramed
below.
Box 2.2
-----------------------------------------------------------------
| Vision Mandala: |
| Task World Model |
| ↑ |
| ↑ recognition |
| ↑ |
| 3-D geometric model |
| ↑ ↓ |
| revelation ↑ ↓ verification |
| ↑ ↓ |
| 2-D images |
| |
-----------------------------------------------------------------
I wish to call attention to the lower part of the above
mandala (a mandala being any circle-like system diagram); this
portion is the core of the 3-D geometric modeling vision system.
Depending on circumstances, the vision system should be able to run
almost entirely top-down (verification vision) or bottom-up
(revelation vision). Verification vision is all that is required in a
well know predictible environment; whereas, revelation vision is
required in a brand new (tabula rasa) or rapidly changing
environment. Thus revelation and verification form a loop,
bottom-up and top down. First, there is some kind of revelation that
forms (or selects) a 3-D model; and second, the model is verified by
testing image features predicted from the assumed model. This loop
like structure has been noted before by others; it is a form of what
Tenebaum[71] called "accomodation"; and it is a form of what Falk[69]
called "heuristic vision"; however I favor the term "visual
feedback".
Completing the design, the images and worlds are
constructed, manipulated and compared by a variety of processors. The
topmost of which is the task processor. Since the task processor is
expected to vary with the application; it would be expedient if it
could be isolated as a user program calling only utility routines
of an appropriate vision sub-system. Immediately below the task
processor are the 3-D recognition routines and the 3-D modeling
routines. The modeling routines underlie most everything; because
they are used to create, alter and access the models.
The remaining processors include the reality simulator which
does Newtonian mechanics for modeling motion, collision and gravity.
Also there are image analyzers, which do image enhancement and
conversions such as converting video rasters into line drawings.
There is an image synthesizer, which does hidden line and surface
elimination, for verification by comparing synthetic images from the
model with perceived images of reality. There are three kinds of
locus solvers that compute numerical descriptions for cameras, light
sources and physical objects. Finally, there is of course a large
number (at least ten) different compare processors for confirming or
denying correspondences among entities in each of the different kinds
of images and 3-D models.
Box 2.3
-----------------------------------------------------------------
| Some Basic Kinds of Vision Processors. |
| 0. The task processor. |
| 1. 3-D recognition. |
| 2. 3-D modeling routines, |
| 3. Reality simulator. |
| 4. Image analyser. |
| 5. Image synthesizer - hidden line eliminator. |
| 6. Locus solvers: camera, sun and object. |
| 7. Comparators: 2D and 3D. |
-----------------------------------------------------------------
2.2 Vision Tasks.
The 3-D vision research problem being discussed is that of
finding out how to write programs that can see in the real world.
Alternate related vision research problems include: modeling human
perception, solving visual puzzles (non-real world), and developing
advanced automation techniques (ad hoc vision). In order to approach
the problem, specific programming tasks are proposed and solutions
are sought; however please distingush the idea of a research problem
from that of a programming task. As will be illustrated, many vision
tasks can be done without vision. The vision solution to be found
should be able to deal with real images, should include the
continuity of the visual process in time and space, and should be
general purpose rather than ad hoc. These three requirements
(reality, continuity, generality) will be developed by surveying
six examples of computer vision tasks.
First, there is the robot chauffer task. In 1969, John
McCarthy asked me to consider the vision requirements of a computer
controlled car such as he depicted in an essay [appendix]. The idea
is that a user of such an automatic car would request a destination;
the robot would select a route from an internally stored road map;
and it would then proceed to its destination using visual data. The
problem involves representing the road map in the computer and
establishing the correspondence between the map and the appearance of
the road as the automatic chauffer drived the vehicle along the
selected route. Lacking a computer controlled car, the problem was
abstracted to that of tracing a route along the driveways and parking
lots that surround the Stanford A.I. Laboratory using a television
camera and transmitter mounted on a radio controlled electric cart.
The robot chauffer task could be solved by non-visual means such as
by railroad like guidance or by inertial guidance; to preverse the
vision aspect of the problem, no particular artifacts should be
required along a route (landmarks must be found, not placed); and the
extent of inertial dead reckoning should be noted.
Second, there is the task of a robot explorer. In 1967,
McCarthy and Lederberg, published a description of a robot for
exploring the surface of the planet Mars. The robot explorer was
required to run for long periods of time without human intervention
because the signal transmission time to Mars is as great as twenty
minutes and because the 23.5 hour Martian day would place the vehicle
out of Earth sight for twelve hour at a time. (This latter difficulty
could be avoided at the expense of having a set of communication
relay satellites in orbit around Mars). The task of the explorer
would be to drive around mapping the surface of Mars, looking for
interesting features, and doing various experiments. To be prudent,
a Mars explorer should be able to navigate without vision; this can
be done by driving slowly and by using a tactile collision and
crevasse detector. If the television system fails, the core samples
and so on can still be collected at different Martian sites without
unusual risk to the vehicle due to visual blindness.
The third vision task is that of the robot soldier, tank,
sentry, pilot or policeman. The problem has several forms which are
quite similar to the chauffeur and the explorer with the additional
goal of doing something nasty to an enemy. Although this vision task
has not yet been explicitly attempted at Stanford, to the best of my
knowledge, the reader should be warned that a thorough solution to
any of the other tasks almost assures the Orwellian technology to
solve this one.
Fourth, the turn table task is to construct a 3-D model from
a sequence of 2-D television images taken of an object rotated on a
turn table. The turntable task was selected as a simplification of
the explorer task and is an example of a nearly pure desriptive
vision task.
Fifth, the classic blocks vision task consists of two parts:
first convert a video image into a line drawing; second, make a
selection from a set of predefined prototype models of blocks that
accounts for the line drawing. In my opinion, this vision task
emphasives three pitfalls: single image vision, line drawings and
blocks. The greatest pitfall, in the usual blocks vision task, is the
presumption that a single image is to be solved; thus diverting
attention away from the most important depth perception mechanism
which is parallax. The second pitfall is that the usual notion of a
perspective line drawing is not a natural intermediate state; but is
rather a very sophisticated and platonic geometric idea. The perfect
line drawing lacks photometric information; even a line drawing with
perfect shadow lines included will not resemble anything that can
readily be gotten by processing real television pictures. Curiously,
the lack of success in deriving line drawings from real television
images of real blocks has not dampened interest in solving the second
part of the problem. The perfect line drawing puzzle, was first
worked on by Guzman and extended to perfect shadows by Waltz;
nevertheless, enough remains so that the puzzle will persist on its
own merits, without being closely relevant to real world computer
vision. Even assuming that imperfect line drawings are given, the
final unreality of the blocks themselves, have seduced such
researchers as Falk and Grape to build byzantine systems of
vertex-edge classification which almost certainly can not be extended
beyond the blocks domain. Actually, the blocks would not be such a
bad research simplification, if researchers could avoid getting hung
up in the fact that they have edges and vertices, but concentrate
instead on where the block are and on how they scatter light. The
blocks task can be rehabilitated by requiring photometric modeling
and by requiring the use multiple images for depth perception.
Sixth, the Stanford Hand Eye Project has recently dedicated
itself to solving the task of automatic machine assembly. In
particular, the group will try to develope techniques that will be
demonstrated by the fully automatic assembly of a chain saw gasoline
engine. The two pressing vision questions of machine assemble are
where is the part and where is the hole; these questions will be
initially handled by composing ad hoc part and hole detectors for
each vision step required for the assembly.
The point of this task survey was to sharpen our taste for
what is and is not a task requiring real 3-D vision; and to point out
that caution has to be taken to preserve the vision aspects of a
given task. In the usual course of vision projects, a single task or
a single tool unfortunately dominates the research; my work is no
exception, the one tool is 3-D modeling, and the task that dominated
the formative stages of the research is that of the robot chauffer
cart. A better understanding of the ultimate nature of computer
vision can be obtained by keeping the several tasks and the several
tools in mind.
---------------------------------------------------------------------
BOX 2.4 TABLE OF 3-D COMPUTER VISION TASKS.
---------------------------------------------------------------------
1. The Robot Chauffeur. Cart Task.
Given a computer controlled cart and a road map,
drive the cart along a preselected route,
without crashing into anything.
2. The Robot Explorer. Cart Task.
Given a computer controlled cart,
explore and map the world,
without crashing into anything.
3. The Robot Soldier. Cart Task.
Given a computer controlled vehicle,
locate and destroy the enemy.
4. Turn Table Task.
The turn table task in to construct a 3-D model from a
sequence of 2-D television images taken of an object
rotated on a turn table.
5. The Blocks Task.
First, convert a video image into a line drawing;
Second, identify and locate the blocks in the line drawing.
6. Machine Assembly Tasks.
Where is the part ? Where is the hole ?
Location Task: Where is it.
Identification Task: What is it.
---------------------------------------------------------------------
2.3 The Nature of Images.
An image is a 2-D spatial data structure representing a
rectangle from the pattern of light formed by a thin lens. A sequence
of images in time is called a film. For the present design theory,
there are three basic kinds of information in an image: photometric,
geometric, and topological; also there are three kinds of 2-D
images: raster, contour, and mosaic.
Geometry fundamentally has to do with distance measure. The
geometry of an image is based on coordinate pairs that are associated
with the elements that form the image. From the coordinates such
geometric properties as length, area, angle and moments can be
computed. Photometry has to do with light measure. Although physical
measurements of light may include power, hue, saturation,
polarization and phase; only the relative power between points of the
same image is easily available to the computer using a television
camera. The acquistion of color images is possible at Stanford by
means of color filters; however, color does not significantly change
the vision problem at hand. Topology has to do with neighborhoods,
what is next to what. Topological data may be explicitly represented
by pointers between related entities, or implicitly represented by
the format of the data.
A raster image is a two dimensional integer valued array of
pixels. A pixel, "picture element", is a single sample position on
a vidicon. Although the real shape of a pixel is probably that of a
blunt ellipse; the fiction that pixels tesselate the image into
little rectangles will be adopted. For other theoretical purposes the
array is assumed to be formed by sampling and truncating a two
argument, smooth, infinitely differentable real valued function; this
latter fiction will only occasionally be needed here.
Given a raster image, the classical approach is to find the
features. One feature, which rarely escapes the notice even the
dullest student of the subject, is called the "edge". For a naive
start, an edge can be defined as a locus of change in the image
function; the locus set where edges are not found will be called
"regions". Edges and regions are two sides of the same slippery
thing, which even the sharpest students of vision have not yet fully
grasped. A sophisticated definition of the region/edge notion should
include an effective procedure for converting a raster approximation
of an image function into a region/edge representation. Two
region/edge structures of particular value are the contour map and
the mosaic.
A contour image is like a geodesic contour map. In such
contour images no two contours ever cross and all contours close. A
mosaic image (or tesselation) is like a ceramic tile mosaic. In such
a mosaic image no two regions ever overlap and the whole image is
explicitly covered with tiles. Further useful restrictions might be
made concerning whether it is permitted to have tiles with holes in
them, and whether it is permitted for a tile to have points that are
thinner than a single pixel.
2.4 The Nature of Worlds.
The physical information most directly relevant to vision is
the location, extent and light scattering properties of solid opaque
objects; the location, orientation and scales of the camera that
takes the pictures; and the location and nature of the light that
illuminates the world. The transformation rules of the everyday
world that a programmer may assume, a priori, are the laws of
Newtonian physics. The arguments against geometric modeling, divide
into two catagories: the reasonible and intuitive.
The reasonible arguments attack 3-D geometric modeling by
comparing it to another modeling alternative, (some alternatives are
listed in the box immediately). My overall view is that the domains
of greatest efficiency of the possible kinds of models do not
overlap; and that an artificial intellect will have some portion of
each kind. Nevertheless I feel that 3-D geometric modeling is
superior for the task at hand, and that the other models are less
relevant to vision.
---------------------------------------------------------------------
|BOX: Assumption: |
| |
| The visual world model should be a 3-D geometric model. |
| |
| Alternatives: |
| |
| 1. Image memory and with only camera model in 3-D. |
| 2. Statistical world model, e.g. Duda & Hart. |
| 3. Procedural Knowledge, e.g. Hewett & Winograd. |
| 4. Semantic knowledge, e.g. Wilkes & Shank. |
| 5. Formal Logic models, e.g McCarthy & Hayes. |
| 6. Syntactic models. |
---------------------------------------------------------------------
The best alternative to a 3-D geometric model is to have a
library of little 2-D images describing the appearance of various 3-D
loci from given directions. The advantage would be that a
sophisticated image predictor would not be required; on the other
hand the image library is potentially quite large and that even with
a huge data base new views and lighting of familair objects and
scenes can not be anticipated.
The statistical model, is quite relevant to vision and can be
added to the geometric model. However, the statistical model can not
stand alone because the processes of occultation, rotation and
illumination make the approach infeasible.
Procedural knowledge models represent the world in terms of
routines (or actors) which either know or can compute the answer to a
question about the world. Semantic models represent the world in term
of a data structure of conceptual statements; and formal logic models
represent the world in terms of first order predicate calculus or in
terms of a situation calculus. The procedural, semantic and formal
logic world models are of course all general enough to represent a
vision model and in a theoretical sense they are just other notations
for 3-D geometric modeling. However in practice, these three
modeling regimes are not efficient holders and handlers of
quantitative geometric data; but are rather intended for a higher
level of abstract reasoning. Another alleged advantage of these
higher models is that they can represent partial knowledge and
uncertainty, which in a geometric model is implicit, in that
structures are missing or incomplete. For example, McCarthy and
Feldman demand that when a robot has only seen the front of an office
desk that the model should be able to draw inferences about the back
of the desk; I feel that this so called advantage is not required by
the problem and that basic visual modeling is on a more agnostic
level.
The syntactical approach to descriptive vision is that an
image is a sentence of a picture grammar and that consquently the
image description should be given in terms of the sequence of grammar
transformations rules. Again this paradigm is theoretically true but
impractical for real images of 3-D objects because simple
replacements rules can not readily express rotation, perspective,
and photometric transformations. On the other hand, the syntactical
models have been of some use in describing 2-D shapes. [Gipps, 74;
Freeman 65].
The intuitive arguments include the opinions that geometric
modeling is too numerical, too exact, or too non-human to be relevant
for computer vision research. Since, I suffer a lack of sympathy
for these positions, I will forsake any pretext of objectivity and
attack them as prejudice and fallacy (bearing in mind that I might
have to apologize and recant at some later date).
The natural mimicry fallacy is that it is false to insist
that a machine must mimic nature in order to achieve its design
goals. Boeing 747's are not covered with feathers; trucks do not have
legs; and computer vision need not simulate human vision. The
advocates of the uniqueness of natural intellegence and perception
will have to come up with a rather hairy proof to establish their
conjecture. In the meantime, one should be open minded about the
potential forms a perceptive counsciousness can take.
The self introspection fallacy is that it is false to insist
that one's introspections about how he thinks and sees are direct
observations of thought and sight. By introspection some conclude
that the visual models (even on a low level) are essentially
qualitative rather than quantative. My belief is that the vision
processing of the brain is quite quantitative and only passes into
qualities at a higher level of the process. In either case, the
details of human visual processing are inaccessible to conscious self
inspection.
Although, I think that the above two fallacies of intuitition
generate an anti numerical model prejudice, convincing a person of
these fallacies doesn't seem to remove his doubts. Some important
argument or idea is missing that would convince the so prejudiced
potential vision worker of the importance of numerical models prior
to the full achievement of computer vision (vice versa, I have not
heard an argument that would change my prejudice in favor of such
models). This matter of conflicting intuitions would not be
important, were it not that the "they" include so many of my
immediate collegues. (On the otherhand, I may well be proved wrong
when the first really powerful 3-D computer vision system is built
without using any geometric models worth speaking of).
2.5 Locus Solving.
The crux of computer vision is to deduce information about
the world being viewed from images of that world. Accordingly, three
main descriptive vision problems are camera solving, body solving and
sun solving.
---------------------------------------------------------
| Three kinds of Locus Solving: |
| |
| 1. Camera Locus Solving. |
| 2. Body Locus Solving. |
| 3. Sun Locus Solving. |
| |
---------------------------------------------------------
Camera solving is routinely done in two ways: in a single
image the image loci of a set of known world loci (perhaps points on
a calibration object) are measured and a camera model computed; or in
two images a set of corresponding landmark feature points are found
and the whole system is solved. To calibrate a camera with a single
image requires knowing something about the world apriori; calibrating
a camera with multiple images can solve for everything except the
"true" scale and origin of the world.
After the camera positions are known, the location and extent
of the objects composing the scene can be found using parallax.
Parallax is the principal means of depth perception and is the
alchemist that converts 2-D images into 3-D models.
After the camera and object positions are known to some
accuracy, the nature and location of light sources can be deduced
from the shines and shadows. On the other hand, with outdoor
situations the sun position can be rather accurately predicted with a
simple emphemeris which relieves the body solver of some of the
burden of filtering out photometric effects.
2.6 Comparing.
The compare process is the keystone for completing the arch
of a visual feedback system; however the compare need not require
much study if the entities to be mated can be highly individualized
so that there is either a monogamous match or nothing. (Although, the
study of sophisticated comparing is important; in a descriptive
approach to vision, when a simple compare doesn't work, one censures
the object representation rather that the match processor.)
Three important compares can be characterized as verify
compare, reveal compare and recognize compare. The verify compare
involves finding the corresponding entities between a predicted image
and a perceived image for the sake of camera calibration and for the
sake of eliminating know landmark features from the revelation
process. The reveal compare involves finding the corresponding
entities in two percieved images, so that the location and extent of
new landmark objects can be solved. Finally, the recognition
compare involves mating a percieved entity with one of a set of know
model entities; recognition is done both in the 2-D image domain and
in the domain of 3-D object models, although the later will receive
more attention.
Two problems of description comparing are normalization and
segmentation. Normalization involves eliminating irrelevant
differences such as location, orientation and lighting. Segmentation
involves subdividing a complex object into pieces, so that only the
simple small pieces need be matched.
2.7 Mobile Robot Vision.
The elements discussed so far will now be brought together
into a system design for performing mobile robot vision. The proposed
system is illustrated below in the block diagram. Although the robot
chauffered cart was the main task theme of this research; I have
failed to date, March 1974, to achieve the hardware and software
required to drive the cart around the laboratory under its own
control. Nevertheless, this necessarily theoretical cart system has
been of considerible use in developing the visual 3-D modeling
routines and theory, which are the subject of this thesis.
FIGURE 2.2
---------------------------------------------------------------------
Cart Vision Mandala:
→→→→→→→→→→→→→→→→→→→ PERCEIVED →→→→→→ REALITY →→→→→→ PREDICTED →→→→
↑ WORLD SIMULATOR WORLD ↓
↑ ↓
↑ ↓
↑ PERCEIVED →→→→→→ CART →→→→→→→→ PREDICTED →→→↓
↑ CAMERA LOCUS DRIVER CAMERA LOCUS ↓
↑ ↑ ↓ ↓
↑ ↑ ↓ ↓
↑ ↑ THE CART PREDICTED→→→→↓
BODY CAMERA SUN LOCUS ↓
LOCUS LOCUS ↓
SOLVER SOLVER ↓
↑ ↑ ↓
↑ ↑ ↓
REVEAL VERIFY IMAGE
COMPARE COMPARE SYNTHESIZER
↑ ↑ ↑ ↑ ↓
↑ ↑ ↑ ↑ ↓
↑ ←← PERCEIVED→→→→→↑ ↑←←←←←←←←←←←←←←←←←←←← PREDICTED ←←←←←←←↓
←←←←← MOSAIC IMAGE MOSAIC IMAGE ↓
↑ ↑ ↓
↑ ↑ ↓
↑ ↑ ↓
PERCEIVED PREDICTED ↓
CONTOUR IMAGE CONTOUR IMAGE ↓
↑ ↑ ↓
↑ ↑ ↓
↑ ↑ ↓
PERCEIVED PREDICTED ←←←←←←←←←
VIDEO IMAGE VIDEO IMAGE
↑
↑
↑
TELEVISION
CAMERA
The robot chauffer task involves establishing the
correspondence between an internal road map and the appearance of the
road in order to steer a vehicle along a predefined path. For a first
cut, the planned route is assumed to be clear, and the cart and the
sun are assumed to be the only movable things in a static world.
Dealing with moving obstacles is a second problem, motion thru a
static world must be dealt with first.
The cart at the Stanford Artificial Intelligence Laboratory
is intended for outdoors use and consists of a piece of plywood, four
bicycle wheels, six electric motors, two car batteries, a television
camera, a television transmitter, a box of digital logic, a box of
relays, and a toy airplane radio receiver. (The vehicle being
discussed is not "Shakey", which belongs to the Stanford Reseach
Institute's Artificial Intelligence Group. There are two A.I. labs
near Stanford and each has a computer controlled vehicle). The six
possible cart actions are: run forwards, run backwards, steer to the
left, steer to the right, pan camera to the left, pan camera to the
right. Other than the television camera, there is no telemetry
concerning the state of the cart or its immediate environment.
The solution to the cart problem, begins with the cart at a
known starting position with a road map of visual landmarks with
known loci. That is, the upper leftmost two rectangles of the cart
mandala are initialized so that the perceived cart locus and the
perceive world correspond with reality. Flowing across the top of
the mandala, the cart driver, blindly moves the cart forward along
the desired route by dead reckoning (say the cart moves five feet and
stops) and the driver updates the predicted cart locus. The reality
simulator is an identity in this simple case because the world is
assumed static. Next the image synthesizer uses the predicted world,
camera and sun to compute a predicted image containing the landmarks
features expected to be in view. Now, in the lower left of the
mandala, the cart's television camera takes a perceived picture and
(flowing upwards) the picture is converted into a form suitable for
comparing and matching with the predicted image. Features that are
both predicted and perceived and found to match are used by the
camera locus solver to compute a new perceived camera locus (from
which the cart locus can be deduced). Now the cart driver compares
the perceived and the predicted cart locus and corrects its course
and moves the cart again, and so on.
BOX 2.7
---------------------------------------------------------
| Chauffer Cart Task Solution: |
| |
| 1. Predict (or retrieve) 2D image features. |
| 2. Perceive (take) a television picture and convert. |
| 3. Compare (verify) predicted and perceived features.|
| 4. Solve for camera locus. |
| 5. Servo the cart along its intended course. |
---------------------------------------------------------
The remaining limb of the cart mandala is invoked in order to
turn the chauffer into an explorer. Perceived images are compared
thru time by the reveal compare and new features are located by the
body locus solver and placed into the world model.
Now the generality and feasibility of such a cart system
depends almost entirely on the representation of the world and the
representation of image features. (The more general, the less
feasible). Although, the bulk of the rest of this document developes
polyhedral representation for the sake of photometric generality;
four simpler cart systems could be realized by using simpler models.
A first system, consists of a road map, a road model, a road
model generator, a solar emphemeris, an image predictor an image
comparator, a camera locus solver, and a course servo routine. The
roadways and nearby environs are entered into the computer. In fact,
real roadways are constructed from a two dimensional X,Y allignment
map showing way the center of the road goes as a curve composed of
line segement and circular arcs; and a second two dimensional S,Z
elevation diagram; showing the height of the surface above sea level
as a funtion of distance along the road; as a sequence of linear
grades and vertical arcs which (not too surprising) are nearly cubic
splines. A second version, is like the first except the road model,
road model generator, and image predictor are replaced by a library
of road images. In this system the robot vehicle is "trained" by
being driven down the roads it is suppose to follow. A third system
is like the first except that the road map is not initially given,
and indeed the road is no longer presumed to exist. Part of the
problem becomes finding a road, a road in the sense of a clear area;
this version yeilds the cart explorer and if the clear area is found
quite rapidly and the world is updated quite frequently, the explorer
can be a chauffer that can handle obstacles and moving objects. The
fourth system is like the third, except that the world is modeled by
a single valued surface elevation function, rather than by a
polyhedral model.
2.8 Related Work.
Larry Roberts is justly credited for doing the seminal work
in 3-D Computer Vision; although his thesis appeared over ten years
ago the subject has languished dependent on and overshadowed by the
four areas called: Image Processing, Pattern Recognition, Computer
Graphics, and Artificial Intelligence. Outside the computer
sciences, workers in psychology, neurology and philosophy also seek a
theory of vision.
IMAGE PROCESSING involves the study and development of
programs that enhance, transform and compare 2D images. Nearly all
image processing work can eventually be applied to computer vision in
various circumstances. A good survey of this field can be found in an
article by Rosenfeld[69]. Image PATTERN RECOGNITION involves two
steps: feature extraction and classification. A comprehensive text
about this field with respect to computer vision, has been written by
Duda and Hart[73]. COMPUTER GRAPHICS is the inverse of discriptive
computer vision. The problem of computer graphics is to synthesis
images from three dimensional models; the problem of discriptive
computer vision is to analyze images into three dimensional models.
An introductory text book about this field would be that of Newman
and Sproull[73]. Finally, there is ARTIFICIAL INTELLIGENCE, which in
my opinion is an institution sheltering a heterogenous group of
embryonic computer subjects; the biggest of the present day orphans
include: robotics, natural language, theorem proving, speech
analysis, vision and planning. A more narrow and relevant definition
of artificial intelligence is that it concerns the programming of the
robot task processor which sits above the vision system. There is no
general reference on Artificial Intelligence that I wish to
recommend.
The related vision work of specific individuals has already
been mention in context. To summarize, my vision work is related to:
Early: Roberts[63], Sutherland[63]; Stanford: Falk, Feldman and
Paul[67] Tenenbaum[72], Agin[72], Grape[73]; MIT: Guzman, Horn,
Waltz, Krakaurer;UTAH: Warnock, Watkins; other places: SRI and JPL.
Future progress in computer vision will proceed in step with
better computer hardware, better computer graphics software, and
better world modeling software. Future vision work at Stanford,
which is related to the present theory will be done by Lynn Quam and
Hans Morevac. At JPL and SRI, similar work on vehicle vision work is
being done.
The machine assembly task is being pursued both by the
Artificial Intelligence Group of the Stanford Research Institute and
by the Hand Eye Project at Stanford University. Because the demand
for doing practical vision tasks can be satisfied with existing ad
hoc methods or by not using a visual sensor at all; I expect little
or no vision progress per se from such reseach, although their
demonstrations should be robotic spectaculars.
Since, the missing ingredient for computer vision is the
spatial modeling to which perceive images can be related; I believe
that the development of the technology for generating commercial film
and television by computer for entertainment will make significant
contribution to computer vision.
2.9 Visual Consciousness.
"For the purpose of presenting my argument I must first
explain the basic premise of sorcery as don Juan presented it to me.
He said that for a sorcerer, the world of everyday life is not real,
or out there, as we believe it is. For a sorcerer, reality or the
world we all know, is only a description. For the sake of validating
this premise don Juan concentrated the best of his efforts into
leading me to a genuine conviction that what I held in mind as the
world at hand was merely a description of the world; a description
that had been pounded into me from the moment I was born."
- Carlos Castaneda. Journey to Ixtlan.
The larger context of a vision theory depends on ones'
opinion about human counsciousness. In my opinion, mind is a program
that is running in the brain. Now consider what software is needed to
account for counsciousness, the private life of the self that burns
in our heads. The so called stream of counsciousness consists of
little voices talking, fragments of music playing, and a color
visual display of the present place and moment. I believe that the
computation being performed by an intellectual entity in order to
stay visually counscious of its external world is a reality
simulation in sync with sensory perception. If the individual is
deprived of sensations, the simulator can go on without, providing
the mind with dreams and hallucination.
The basic inspiration for this idea is an analogy between 3-D
computer graphics and human vision. First consider computer
graphics, it is possible to program a computer to simulate the view
of a camera moving thru a simulated scene. Architects look at
simulated buildings, cartoonist look at simulated commercials, and
pilots look at simulated aircraft carriers. Second, the position of
the simulated camera can be controlled either by direct command or
indirectly by a further simulation, such as of an airplane. However,
in one unusual 3-D display system, at the University of Utah, the
simulated camera position is maintained in sync with the head/eye
position of the viewer.
Now consider human vision. You are where your eyes are. The
analogy is that the display simulator resembles the visual display
that goes on inside one's head. That is the computer display is to
the whole man, as the visual counscousness is to the mind's eye. The
mind "watchs" the visual counsciousness "display" models in the short
term memory. Such a visual counsciousness is one of a finite number
of nested perceptual systems which reduce the world until it is
simple enough to be handled directly by a set of goal seeking task
processors which comprise one's soul.
2.10 Summary.
To recapitulate, three vision system design requirements were
postulated: reality, generality, and continuity. These requirements
were illustrated by discussing a number of vision related tasks.
Next, a vision system was described as mediating between 2-D images
and a world model; with the world model being further broken down
into a 3-D geometric model and a task world model. Between these
entities three basic vision modes were identified: recognition,
verification and revelation (description). Finally, the general
purpose vision system was depicted as a quantitative and description
oriented feedback cycle which maintain a 3-D geometric model for the
sake of higher qualitative, symbolic, and recognition oriented task
processors.
Approaching the vision system in greater detail; the role of
seven (or so) essential kinds of processors were explained: the task
processor, 3-D modeling routines, reality simulator, image
analyser, image synthesizer, comparators, and locus solvers. The
processors and data types were assembled into a cart chauffer system.
Computer vision is related to (if not contained in) image processing,
pattern recognition, computer graphics and artificial intelligence.